RStudio Exercise 4

This week clustering and classification are central in this course.

Create this file and describe the data

library(MASS)
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

In the “Boston”-data set there are 506 observations with the following 14 variables:

  1. crim: per capita crime rate by town.
  2. zn: proportion of residential land zoned for lots over 25,000 sq.ft.
  3. indus: proportion of non-retail business acres per town.
  4. chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  5. nox: nitrogen oxides concentration (parts per 10 million).
  6. rm: average number of rooms per dwelling.
  7. age: proportion of owner-occupied units built prior to 1940.
  8. dis: weighted mean of distances to five Boston employment centers.
  9. rad: index of accessibility to radial highways.
  10. tax: full-value property-tax rate per $10,000.
  11. ptratio: pupil-teacher ratio by town.
  12. black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
  13. lstat: lower status of the population (percent).
  14. medv: median value of owner-occupied homes in $1000s.

Grafical overview and summary

pairs(Boston)

This plot isn’t very helpful for interpretation of the data, but shows some correlative relationships. A clearer statement about correlations will be made based upon the correlationsplot and -table below.

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

In the summary there are som non-parametrically distributed parameters as crim, zn, rad, age, tax, indus and black. Other parameters can be distributed parameterically. In case of doubt a test for normality could be done.

library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.3.4     v dplyr   0.7.4
## v tidyr   0.7.2     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x dplyr::select() masks MASS::select()
library(corrplot)
## corrplot 0.84 loaded
cor_matrix<-cor(Boston) %>% round(digits = 2)
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
corrplot(cor_matrix, method="circle", type = "upper", cl.pos= "b", tl.pos = "d", tl.cex = 0.6)

There seem to be negative correlations between dis versus age, nox, indus and tax, suggesting that the further from the employment centers the buildings are younger, there is less nitrous oxide pollution ,the proportion of non-retail businesses is lower and the full-value property-tax rate is higher. The correlation between medv and lstat is outspoken negative, meaning that areas with lower social status of the population have cheaper owner-occupied houses and vice versa.

Standarized data set and division into training and test set

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
boston_scaled <- as.data.frame(boston_scaled)

The data is scaled to the values above to eliminate the effect of different values on the following analyses. The data is divided into a training and test set.

# create a quantile vector of crim
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins,  labels = c("low", "med_low", "med_high", "high"), include.lowest = TRUE)
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset 
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set 
test <- boston_scaled[-ind,]

Linear discriminant analysis

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2128713 0.2549505 0.2574257 0.2747525 
## 
## Group means:
##                   zn      indus        chas        nox          rm
## low       1.13560957 -0.8950705 -0.13498836 -0.9352359  0.47757397
## med_low  -0.07158173 -0.2791022 -0.11943197 -0.5722214 -0.12824924
## med_high -0.41220520  0.2075962  0.29552193  0.3957019  0.07519994
## high     -0.48724019  1.0149946 -0.05951284  1.0532891 -0.40638816
##                 age        dis        rad        tax     ptratio
## low      -0.9564770  0.8990340 -0.6840710 -0.7423246 -0.49346475
## med_low  -0.3307019  0.3848772 -0.5548199 -0.4643284 -0.08977988
## med_high  0.4301343 -0.4054097 -0.4297236 -0.3161061 -0.23217627
## high      0.8184563 -0.8466611  1.6596029  1.5294129  0.80577843
##                black       lstat         medv
## low       0.37122281 -0.79211635  0.579907838
## med_low   0.34180240 -0.11395265 -0.005994976
## med_high  0.09312394  0.04811209  0.135168114
## high     -0.71516683  0.89911429 -0.692679359
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.10868350  0.94674567 -1.08492100
## indus    0.08319824  0.03141315  0.18033520
## chas    -0.09470058 -0.09376397 -0.07637680
## nox      0.18396848 -0.72477190 -1.24804868
## rm      -0.15883310 -0.11053784 -0.12511803
## age      0.27711676 -0.30499734 -0.08447689
## dis     -0.11480676 -0.25390972  0.46457047
## rad      3.81403979  1.03225877 -0.15036671
## tax     -0.05592896 -0.24183424  0.83936370
## ptratio  0.15412241  0.02773897 -0.36335076
## black   -0.12508473  0.02104907  0.16123529
## lstat    0.11006864 -0.22070666  0.27374891
## medv     0.13426342 -0.30447971 -0.21538978
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9536 0.0355 0.0109
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit , dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

In this example the LD1-model explains 95 % of the clustering with the index of accessibility to radial highways (rad) as the most influential linear separator.

Predict LDA model on test data

# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       16      21        4    0
##   med_low    3      16        4    0
##   med_high   0       4       16    2
##   high       0       0        1   15

The model predicts well the true values for the categorical crime rate. Most of the wrong predictions were adressed to the med_low crime group.

k-means algoritm

#reload Boston and scale
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
# scaled distance matrix
dist_scaled <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_scaled)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(dist_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <-kmeans(dist_scaled, centers = 2)

# plot the Boston dataset with clusters

km <-kmeans(dist_scaled, centers = 2)
pairs(boston_scaled, col = km$cluster)

The optimal number for clusters, according to k-means algorithm is 2. In the correlation plot above, we can see that the plots with strong correlations between the parameters, show for most of the variables a division into two clusters, according to k-means method.

K-means with 4 clusters and LDA

km <-kmeans(dist_scaled, centers = 4)
lda.fit <- lda(km$cluster~., data = boston_scaled)
lda.fit
## Call:
## lda(km$cluster ~ ., data = boston_scaled)
## 
## Prior probabilities of groups:
##         1         2         3         4 
## 0.1304348 0.4308300 0.2114625 0.2272727 
## 
## Group means:
##         crim         zn      indus       chas        nox         rm
## 1  1.4330759 -0.4872402  1.0689719  0.4435073  1.3439101 -0.7461469
## 2 -0.3894453 -0.2173896 -0.5212959 -0.2723291 -0.5203495 -0.1157814
## 3 -0.3912182  1.2671159 -0.8754697  0.5739635 -0.7359091  0.9938426
## 4  0.2797949 -0.4872402  1.1892663 -0.2723291  0.8998296 -0.2770011
##          age        dis        rad        tax     ptratio       black
## 1  0.8575386 -0.9620552  1.2941816  1.2970210  0.42015742 -1.65562038
## 2 -0.3256000  0.3182404 -0.5741127 -0.6240070  0.02986213  0.34248644
## 3 -0.6949417  0.7751031 -0.5965444 -0.6369476 -0.96586616  0.34190729
## 4  0.7716696 -0.7723199  0.9006160  1.0311612  0.60093343 -0.01717546
##        lstat        medv
## 1  1.1930953 -0.81904111
## 2 -0.2813666 -0.01314324
## 3 -0.8200275  1.11919598
## 4  0.6116223 -0.54636549
## 
## Coefficients of linear discriminants:
##                 LD1        LD2         LD3
## crim    -0.18113078 -0.5012256  0.60535205
## zn      -0.43297497 -1.0486194 -0.67406151
## indus   -1.37753200  0.3016928 -1.07034034
## chas     0.04307937 -0.7598229  0.22448239
## nox     -1.04674638 -0.3861005  0.33268952
## rm       0.14912869 -0.1510367 -0.67942589
## age      0.09897424  0.0523110 -0.26285587
## dis     -0.13139210 -0.1593367  0.03487882
## rad     -0.65824136  0.5189795 -0.48145070
## tax     -0.28903561 -0.5773959 -0.10350513
## ptratio -0.22236843  0.1668597  0.09181715
## black    0.42730704  0.5843973 -0.89869354
## lstat   -0.24320629 -0.6197780  0.01119242
## medv    -0.21961575 -0.9485829  0.17065360
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.7596 0.1768 0.0636
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 3)

In this example the LD1-model explains only 76 % of the clustering with rad, nox, indus and zn as the most influential linear separators. K-mean algorithm with 4 centers isn’t a better model to cluster this data.

Plotly

model_predictors <- dplyr::select(train, -crime)
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
#k-means
dist_train <- dist(train)
## Warning in dist(train): NAs introduced by coercion
km <-kmeans(dist_train, centers = 2)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
#Matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library("plotly")
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
#Create a 3D plot of the columns of the matrix product
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=train$crime)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=km$cluster)

We can see that the high risk and med_high risk crime groups from the LDA-trained set overlap with one cluster according to the k-means algorithm for two cluster. The other cluster, according to k-means algoritm overlaps with low and med_low crime groups from the LDA-trained set.